WB Project PDO text analysis

Author

Luisa M. Mimmi

Published

October 2, 2024

Work in progress

Set up

# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(readxl) # Read Excel Files
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
library(rsample) # General Resampling Infrastructure
library(rvest) # Easily Harvest (Scrape) Web Pages
library(cleanNLP) # A Tidy Data Model for Natural Language Processing
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax)

cleanNLP supports multiple backends for processing text, such as CoreNLP, spaCy, udpipe, and stanza. Each of these backends has different capabilities and might require different initialization procedures.

  • CoreNLP ~ powerful Java-based NLP toolkit developed by Stanford, which includes many linguistic tools like tokenization, part-of-speech tagging, and named entity recognition.
    • ❕❗️ NEEDS EXTERNAL INSTALLATION (must be installed in Java with cnlp_install_corenlp() which installs the Java JAR files and models)
  • spaCy ~ fast and modern NLP library written in Python. It provides advanced features like dependency parsing, named entity recognition, and tokenization.
      • ❕❗️ NEEDS EXTERNAL INSTALLATION (fust be installed in Python (with spacy_install() which installs both spaCy and necessary Python dependencies) and the spacyr R package must be installed to interface with it.
  • udpipe ~ R package that provides bindings to the UDPipe NLP toolkit. Fast, lightweight and language-agnostic NLP library for tokenization, part-of-speech tagging, lemmatization, and dependency parsing.
  • stanza~ another modern NLP library from Stanford, similar to CoreNLP but built on PyTorch and supports over 66 languages…

when you initialize a backend (like CoreNLP) in cleanNLP, it stays active for the entire session unless you reinitialize or explicitly change it.

# ---- 1) Initialize the CoreNLP backend
library(cleanNLP)
cnlp_init_corenlp()
# If you want to specify a language or model path:
cnlp_init_corenlp(language = "en", 
                  # model_path = "/path/to/corenlp-models"
                  )

# ---- 2) Initialize the spaCy backend 
library(cleanNLP)
library(spacyr)
# Initialize spaCy in cleanNLP
cnlp_init_spacy()
# Optional: specify language model
cnlp_init_spacy(model_name = "en_core_web_sm")

# ---- 3) Initialize the udpipe backend
library(cleanNLP)
# Initialize udpipe backend
cnlp_init_udpipe(model_name = "english")

# ---- 4) Initialize the stanza backend

—————————————————————————-

Data sources

WB Projects & Operations [CHECK 🔴]

World Bank Projects & Operations can be explored at:

  1. Data Catalog. From which
  1. Advanced Search

—————————————————————————

Load pre-processed Projs’ PDO dataset pdo_train_t

Syntactic annotation is a computationally expensive operation, so I don’t want to repeat it every time I restart the session.

[Saved file projs_train_t ]

Done in ** analysis/_01a_WB_project_pdo_prep.qmd “**

  1. I retrieved manually ALL WB projects approved between FY 1947 and 2026 as of 31/08/2024 using simply the Excel button on this page WBG Projects
    • By the way this is the link “list-download-excel”
    • then saved HUUUGE .xls files in data/raw_data/project2/all_projects_as_of29ago2024.xls
      • (plus a Rdata copy of the original file )
  2. Split the dataset and keep only projs_train (50% of projects with PDO text, i.e. 4413 PDOs)
  3. Clean the dataset and save projs_train_t (cleaned train dataset)
  4. Obtain PoS tagging + tokenization with cleanNLP package (functions cnlp_init_udpipe() + cnlp_annotate()) and saved projs_train_t (cleaned train dataset).
# Load clean Proj PDO train dataset `pdo_train_t`
pdo_train_t <- readRDS(here::here("data" , "derived_data", "pdo_train_t.rds"))

Explain Tokenization and PoS Tagging

i) Tokenization

Breaking units of language into components relevant for the research question into components relevant for the research question is called “tokenization”. Components can be words, ngrams, sentences, etc. or combining smaller units into larger units.

  • Tokenization is a row-wise operation: it changes the number of rows in the dataset.

The choices of tokenization

  1. Should words be lower cased?
  2. Should punctuation be removed?
  3. Should numbers be replaced by some placeholder?
  4. Should words be stemmed (also called lemmatization). ☑️
  5. Should bigrams/multi-word phrase be used instead of single word phrases?
  6. Should stopwords (the most common words) be removed? ☑️
  7. Should rare words be removed?
  8. Should hyphenated words be split into two words? ❌

for the moment I keep all as conservatively as possible

ii) Pos Tagging

Linguistic annotation is a common for of enriching text data, i.e. adding information about the text that is not directly present in the text itself.

Upon this, e.g. classifying noun, verb, adjective, etc., one can discover intent or action in a sentence, or scanning “verb-noun” patterns.

Here I have a training dataset file with:

Variable Type Provenance Description
proj_id chr original PDO data
pdo chr original PDO data
word_original chr original PDO data
sid int output cleanNLP sentence ID
tid chr output cleanNLP token ID within sentence
token chr output cleanNLP Tokenized form of the token.
token_with_ws chr output cleanNLP Token with trailing whitespace
lemma chr output cleanNLP The base form of the token
upos chr output cleanNLP Universal part-of-speech tag (e.g., NOUN, VERB, ADJ).
xpos chr output cleanNLP Language-specific part-of-speech tags.
feats chr output cleanNLP Morphological features of the token
tid_source chr output cleanNLP Token ID in the source document
relation chr output cleanNLP Dependency relation between the token and its head token
pr_name chr output cleanNLP Name of the parent token
FY_appr dbl original PDO data
FY_clos dbl original PDO data
status chr original PDO data
regionname chr original PDO data
countryname chr original PDO data
sector1 chr original PDO data
theme1 chr original PDO data
lendinginstr chr original PDO data
env_cat chr original PDO data
ESrisk chr original PDO data
curr_total_commitment dbl original PDO data

— PoS Tagging: upos (Universal Part-of-Speech)

upos n percent explan
ADJ 21844 0.0854155 Adjective
ADP 27819 0.1087793 Adposition
ADV 3009 0.0117659 Adverb
AUX 3736 0.0146087 Auxiliary
CCONJ 14488 0.0566517 Coordinating conjunction
DET 22121 0.0864987 Determiner
INTJ 57 0.0002229 Interjection
NOUN 72616 0.2839469 Noun
NUM 2282 0.0089232 Numeral
PART 8848 0.0345979 Particle
PRON 2347 0.0091774 Pronoun
PROPN 14867 0.0581337 Proper noun
PUNCT 29365 0.1148245 Punctuation
SCONJ 2219 0.0086768 Subordinating conjunction
SYM 317 0.0012395 Symbol
VERB 26398 0.1032228 Verb
X 3405 0.0133144 Other

On random visual check, these are not always correct, but they are a good starting point for now.

iii) Make low case

pdo_train_t <- pdo_train_t %>% 
  mutate(token_l = tolower(token)) %>% 
   relocate(token_l, .after = token) %>% 
   select(-token_with_ws) %>%
  #Replace variations of "hyphenword" with "-"
  mutate(
    lemma = str_replace_all(lemma, regex("hyphenword|hyphenwor", 
                                         ignore_case = TRUE), "-")
  )

iv) Stemming

Using SnowballC::wordStem to stem the words. e.g.

pdo_train_t <- pdo_train_t %>% 
   mutate(stem = SnowballC::wordStem(token_l)) %>%
   relocate(stem, .after = lemma)

Why Stemming?: For example, in topic modeling, stemming reduces noise by making it easier for the model to identify core topics without being distracted by grammatical variations. (Lemmatization is more computationally intensive as it requires linguistic context and dictionaries, making it slower, especially on large datasets)

Token Lemma Stem
development development develop
quality quality qualiti
high-quality high-quality high-qual
include include includ
logistics logistic logist
government/governance Governemnt/government/governance govern

NOTE: Among word / stems encountered in PDOs, there are a lot of acronyms which may refer to World Bank lingo, or local agencies, etc… Especially when looked at in low case form they don’t make much sense…

v) Document-term matrix or TF-IDF

The tf-idf is the product of the term frequency and the inverse document frequency::

\[ \begin{aligned} tf(\text{term}) &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned} \]

— My own custom_stop_words |

Remove stop words, which are the most common words in a language.

  • but I don’t want to remove any meaningful word for now
# Custom list of articles, prepositions, and pronouns
custom_stop_words <- c(
   # Articles
   "the", "a", "an",   
   "and", "but", "or", "yet", "so", "for", "nor", "as", "at", "by", "per",  
   # Prepositions
   "of", "in", "on", "at", "by", "with", "about", "against", "between", "into", "through", 
   "during", "before", "after", "above", "below", "to", "from", "up", "down", "under",
   "over", "again", "further", "then", "once",  
   # Pronouns
   "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
   "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", 
   "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves" ,
   "this", "that", "these", "those", "which", "who", "whom", "whose", "what", "where",
   "when", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other",
   # "some", "such", "no",  "not", 
   # "too", "very",   
   # verbs
   "is", "are", "would", "could", "will", "be"
)

# Convert to a data frame if needed for consistency with tidytext
custom_stop_words_df <- tibble(word = custom_stop_words)

— TF-IDF matrix on train pdo

# reduce size 

pdo_train_4_tf_idf <- pdo_train_t %>% # 255964
   # Keep only content words [very restrictive for now]
   # normally c("NOUN", "VERB", "ADJ", "ADV")
   filter(upos %in% c("NOUN")) %>% #    72,668 
   filter(!token_l %in% c("development", "objective", "project")) %>%   #  66,741
   # get rid of stop words (from default list)   
   filter(!token_l %in% custom_stop_words_df$word) %>%   #  66,704
   # Optional: Remove lemmas of length 1 or shorter
   filter(nchar(lemma) > 1)  #  66,350

Now, count the occurrences of each lemma for each document. (This is the term frequency or tf)

# This is the term frequency or `tf`

# Count lemmas per document
lemma_counts <- pdo_train_4_tf_idf %>%
  count(proj_id, lemma, sort = TRUE)
# Preview the result
head(lemma_counts) 

With the lemma counts prepared, the bind_tf_idf() function from the tidytext package computes the TF-IDF scores.

# Compute the TF-IDF scores
lemma_tf_idf <- lemma_counts %>%
  bind_tf_idf(lemma, proj_id, n) %>%
  arrange(desc(tf_idf))

What to use: token, lemma, or stem?

General Preference in Real-World NLP:

  • Tokens for analyses where word forms matter or for sentiment analysis.
  • Lemmas (*) for most general-purpose NLP tasks where you want to reduce dimensionality while maintaining accuracy and clarity of meaning.
  • Stems for very large datasets, search engines, and applications where speed and simplicity are more important than linguistic precision.

(*) I use lemma, after “aggressively” reducing the number of words to consider, and removing stop words (at least for now).

_______

TEXT ANALYSIS/SUMMARY

_______

Frequencies of documents/words/stems

We are looking at (training data subset) pdo_train_t which has 255738 rows and 26 columns obtained from 4071 PDOs of 4413 Wold Bank projects approved in Fiscal Years ranging from 2001 to 2023.

entity counts
N proj 4413
N PDOs 4071
N words 13197
N token 11367
N lemma 11440
N stem 8781

[FUNC] save plots

Term frequency

Note: normally, the most frequent words are function words (e.g. determiners, prepositions, pronouns, and auxiliary verbs), which are not very informative. Moreover, even content words (e.g. nouns, verbs, adjectives, and adverbs) can often be quite generic semantically speaking (e.g. “good” may be used for many different things).

In this analysis, Ido not use the STOPWORD approach, but use the POS tags to reduce our dataset to just the content words, that is nouns, verbs, adjectives, and adverbs

[FIG] Overall token freq ggplot

  • Excluding “project” “develop”,“objective”
  • Including only “content words” (NOUN, VERB, ADJ, ADV)
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent token in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 

pdo_wrd_freq <- pdo_train_t %>%   # 123,927
   # include only content words
   filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
   #filter (!(upos %in% c("AUX","CCONJ", "INTJ", "DET", "PART","ADP", "SCONJ", "SYM", "PART", "PUNCT"))) %>%
   filter (!(relation %in% c("nummod" ))) %>% # 173,686 
 filter (!(token_l %in% c("pdo","project", "development", "objective","objectives", "i", "ii", "iii",
                          "is"))) %>% # whne it is VERB
   count(token_l) %>% 
   filter(n > 800) %>% 
   mutate(token_l = reorder(token_l, n)) %>%  # reorder values by frequency
   # plot 
   ggplot(aes(token_l, n)) +
   geom_col(fill = "gray") +
   coord_flip() + # flip x and y coordinates so we can read the words better
   labs(title = title_text,
        subtitle = "[token_l count > 800]", y = "", x = "")+
  theme(plot.title.position = "plot")

pdo_wrd_freq

[FIG] Overall stem freq ggplot

  • Without “project” “develop”,“objective”
  • Including only “content words” (NOUN, VERB, ADJ, ADV)
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent STEM in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 
# Plot
pdo_stem_freq <- pdo_train_t %>%   # 256,632
   # include only content words
   filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
   filter (!(relation %in% c("nummod" ))) %>% # 173,686 
 filter (!(stem %in% c("pdo","project", "develop", "object", "i", "ii", "iii"))) %>%
   count(stem) %>% 
   filter(n > 800) %>% 
   mutate(stem = reorder(stem, n)) %>%  # reorder values by frequency
   # plot 
   ggplot(aes(stem, n)) +
   geom_col(fill = "gray") +
   coord_flip() + # flip x and y coordinates so we can read the words better
   labs(title = title_text,
        subtitle = "[stem count > 800]", y = "", x = "") +
  theme(plot.title.position = "plot")

pdo_stem_freq

Evidently, after stemming, more words (or stems) reach the threshold frequency count of 800.

_______

_______

Create bigrams

Here I use [clnp_annotate() output + ] dplyr to combine consecutive tokens into bigrams.

# Create bigrams by pairing consecutive tokens by sentence ID and token IDs
bigrams <- pdo_train_t %>%
   # keeping FY with tokens
   group_by(FY_appr, proj_id, pdo, sid ) %>%
   arrange(tid) %>%
   # Using mutate() and lead(), we create bigrams from consecutive tokens 
   mutate(next_token = lead(token), 
          bigram = paste(token, next_token)) %>%
   # make bigram low case
   mutate(bigram = tolower(bigram)) %>%
   # only includes the rows where valid bigrams are formed
   filter(!is.na(next_token)) %>%
   ungroup() %>%
   arrange(FY_appr, proj_id, sid, tid) %>%
   select(FY_appr,proj_id, pdo,sid, tid, token, bigram) 
# most frequent bigrams 
count_bigram <- bigrams %>% 
   count(bigram, sort = TRUE)  

Clean bigrams

The issue here is to clean but without separating consecutive words… so I do this split-reunite process to remove stopwords and punctuation. Basically only keeping bigrams made of 2 nouns or ADJ+noun.

# Separate the bigram column into two words
bigrams_cleaned <- bigrams %>%
  tidyr::separate(bigram, into = c("word1", "word2"), sep = " ")

# Remove stopwords and bigrams in EACH component word containing punctuation
bigrams_cleaned <- bigrams_cleaned %>%
   # custom stop words
   filter(!word1 %in% custom_stop_words_df$word, !word2 %in% custom_stop_words_df$word) %>% 
   # Remove punctuation   
   filter(!str_detect(word1, "[[:punct:]]"), !str_detect(word2, "[[:punct:]]"))  

# Reunite the component cleaned words into the bigram column
bigrams_cleaned <- bigrams_cleaned %>%
   unite(bigram, word1, word2, sep = " ") %>% 
   # Remove too obvious bigrams 
   filter(!bigram %in% c("development objective", "development objectives", 
                         "proposed project", "project development", "program development"))

# View the cleaned dataframe
bigrams_cleaned

# Count the frequency of each bigram
bigram_freq <- bigrams_cleaned %>%
  count(bigram, sort = TRUE)

[FIG] most frequent bigrams in PDOs

  • Excluding bigrams where 1 word is among stopwords or a punctuation sign
  • Excluding “development objective/s”, “proposed project”, “program development” because not very informative
# ---- Prepare data for plotting
# Evaluate the title with glue first
title_text <- glue::glue("Frequency of bigrams in PDOs over FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 
# Define the bigrams you want to highlight
bigrams_to_highlight <- c("public sector", "private sector", "eligible crisis")   

 
# ---- Plot the most frequent bigrams
pdo_bigr_freq <- bigram_freq %>%
   slice_max(n, n = 25) %>%
   ggplot(aes(x = reorder(bigram, n), y = n,
              fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
   geom_col() +
   # coord flipped so n is Y axis
   scale_y_continuous(breaks = seq(min(bigram_freq$n)-1, max(bigram_freq$n), by = 50)) +
   scale_fill_manual(values = c("public sector" = "#005ca1", 
                                "private sector" = "#9b2339", 
                                "eligible crisis"= "#8e550a", "Other" = "grey")) +
   guides(fill = "none") +
   coord_flip() +
   labs(title = title_text, subtitle = "(ranking top 25 bigrams)",
        x = "", y = "") +
   theme(plot.title.position = "plot",
         axis.text.y = element_text(
            # obtain vector of colors 2 match x axis labels color to fill
            color = bigram_freq %>%
               slice_max(n, n = 25) %>%
               # mutate(color = ifelse(bigram %in% bigrams_to_highlight,
               #                       ifelse(bigram == "public sector", "#005ca1",
               #                              ifelse(bigram == "private sector", "#9b2339", "#8e550a")),
               #                       "#4c4c4c")) 
               mutate(color = case_when (
                  bigram == "public sector" ~ "#005ca1",
                  bigram == "private sector" ~ "#9b2339",
                  bigram == "eligible crisis" ~ "#8e550a",
                  TRUE ~ "#4c4c4c")) %>%
               # Ensure the order matches the reordered bigrams (AS BINS)
               arrange(reorder(bigram, n)) %>%  
               # Extract the color column in bin order as vector to be passed to element_text()
               pull(color)
            )
         )

pdo_bigr_freq

Results are not surprising in terms of frequent combinations like “increase access”, ” institutional capacity”, “poverty reduction”. Although, while “health” recurred in several bigrams (e.g. “health services”, “public health”, “health care”) among the top 25, “education” did not appear at all.

A bit misterious is perhaps “eligible crisis” (> 100 mentions)?!

[FIG] Changes over time BY 1FY

On the other hand, “climate change” appears in the top 25 (ranking above “financial sector” and “capacity building”) which begs the question of whether the frequency of these bigrams has changed over time.

## too busy to be useful
# Step 1: Count the frequency of each bigram by year
top_bigrams_FY <- bigrams_cleaned %>%
   group_by(FY_appr, bigram) %>%
   summarise(count = n(), .groups = 'drop') %>%
   arrange(FY_appr, desc(count)) %>%
   # ---  +/- top 10  
   group_by(FY_appr) %>%
   top_n(10, count) %>%
   ungroup()
   # # ---  STRICT  top 10  
   # mutate(rank = dense_rank(desc(count))) %>%  # Rank bigrams by frequency
   # filter(rank <= 10) %>%  # Keep only the top 10 by rank
   # ungroup()

  
# Add specific bigrams to highlight, if any
bigrams_to_highlight <- c("climate change",  "climate resilience", "public sector", "private sector")

# Step 2: Plot the top bigrams by frequency over time   
pdo_bigr_FY_freq  <-  top_bigrams_FY %>% 
 ggplot(aes(x = reorder(bigram, count), 
             y = count,
             fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
  geom_col() +
  scale_fill_manual(values = c("public sector" = "#005ca1", "private sector" = "#e60066", 
                               "climate change" = "#399B23", "climate resilience" = "#d8e600",
                               "Other" = "grey")) +
  guides(fill = "none") +
  coord_flip() +
  facet_wrap(~ FY_appr, scales = "free_y") +
  labs(title = "Top 10 Bigrams by Frequency Over Time",
       subtitle = "(Faceted by Fiscal Year Approval)",
       x = "Bigrams",
       y = "Count") +
  theme_minimal() +
  theme(plot.title.position = "plot",
        axis.text.x = element_text(angle = 45, hjust = 1))

pdo_bigr_FY_freq

[FIG] Changes over time BY 3FY

# generate FY group 
f_generate_year_groups <- function(years, interval) {
  breaks <- seq(floor(min(years, na.rm = TRUE) / interval) * interval, 
                ceiling(max(years, na.rm = TRUE) / interval) * interval, 
                by = interval)
  
  labels <- paste(breaks[-length(breaks)], "-", breaks[-1] - 1)
  
  return(list(breaks = breaks, labels = labels))
}
# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 3 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show

# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_FY <- bigrams_cleaned %>%
   # cut divides the range of x into intervals
   mutate(FY_group = base::cut(FY_appr, 
                               breaks = year_groups$breaks, 
                               include.lowest = TRUE, 
                               right = FALSE, 
                               labels = year_groups$labels)) %>% 
   # Count the frequency of each bigram by n-year groups
   group_by(FY_group, bigram) %>%
   summarise(count = n(), .groups = 'drop') %>%
   arrange(FY_group, desc(count)) %>%
   # Top ? bigrams for each n-year period
   group_by(FY_group) %>%
   top_n(top_n_i, count) %>%
   ungroup()

# --- Step 3: Add specific bigrams to highlight, if any
bigrams_to_highlight <- c("climate change",  "climate resilience", 
                          "eligible crisis",  
                          "public sector", "private sector")

# --- Step 4: Plot the top bigrams by frequency over n-year periods
pdo_bigr_FY_freq  <-  top_bigrams_FY %>% 
 ggplot(aes(x = reorder(bigram, count), 
             y = count,
             fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
  geom_col() +
  scale_fill_manual(values = c("public sector" = "#005ca1", 
                               "private sector" = "#e60066", 
                               "climate change" = "#399B23", 
                               "climate resilience" = "#d8e600",
                               "eligible crisis" = "#e68000",  
                               "Other" = "grey")) +
  guides(fill = "none") +
  coord_flip() +
  facet_wrap(~ FY_group, ncol = 3 , scales = "free_y" )+ 
              #strip.position = "top") +  # Facet wrap with columns
  labs(title = "Top 10 Bigrams by Frequency Over n-Year Periods",
       subtitle = glue::glue("(Faceted by {interval_i}-Year Groups)"),
       x = "Bigrams",
       y = "Count") +
  theme_minimal() +
  theme(plot.title.position = "plot") 

# print the plot
pdo_bigr_FY_freq

Frequency observed over FY intervals is very revealing. For example “private sector” and “public sector” loose importance over time (around mid 2010s), while “climate change” and “climate resilience” gain relevance from the same point on.

Still quite surprising the bigram “eligible crisis”, which actually appears in the top 12 bigrams starting in FY 2013-2015!

🤔 Which are the most frequent and persistent Bigrams Over Time?

For this, I am looking for a ranking that considers Mean frequency across periods arrange(desc(mean_count)) + Stability (low standard deviation) across periods [this is hard bc of NAs], and NOT total count overall…

  • Using top_bigrams_FY which had breaks of 3FY
# [REPEATED just to see the table]
# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 3 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show

# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_FY <- bigrams_cleaned %>%
   # cut divides the range of x into intervals
   mutate(FY_group = base::cut(FY_appr, 
                               breaks = year_groups$breaks, 
                               include.lowest = TRUE, 
                               right = FALSE, 
                               labels = year_groups$labels)) %>% 
   # Count the frequency of each bigram by n-year groups
   group_by(FY_group, bigram) %>%
   summarise(count = n(), .groups = 'drop') %>%
   arrange(FY_group, desc(count)) %>%
   # Top ? bigrams for each n-year period
   group_by(FY_group) %>%
   top_n(top_n_i, count) %>%
   ungroup()
# Calculate the mean frequency and standard deviation of the counts for each bigram across periods
stable_and_frequent_bigrams <- top_bigrams_FY %>%
   group_by( bigram) %>%
   summarise(mean_count = mean(count, na.rm = TRUE),     # Mean frequency across periods
             sd_count = sd(count, na.rm = TRUE),         # Stability (lower sd = more stable)
             total_count = sum(count)) %>%               # Total count across all periods (optional)
   arrange(desc(mean_count)) %>%                      # Sort by frequency and then stability
   # Filter out bigrams with low mean frequency or high instability (you can adjust thresholds)
   # Focus on the top 25% most frequent bigrams
   filter(mean_count > quantile(mean_count, 0.70, na.rm = TRUE)) #%>% 
   # Focus on the most stable 50% (lower sd) ---> NO bc NA values
   #filter( sd_count < quantile(sd_count, 0.5, na.rm = TRUE))

[TBL] Bigrams Over Time [3FY]

# View the most frequent and stable bigrams
stable_and_frequent_bigrams %>% 
   slice_head(n = 15)  %>% kableExtra::kable()
bigram mean_count sd_count total_count
increase access 36.28571 9.604067 254
threat posed 33.00000 NA 33
private sector 32.40000 12.340989 162
health preparedness 31.00000 NA 31
eligible crisis 30.75000 11.441882 123
poverty reduction 30.25000 14.430870 121
climate change 28.50000 4.949747 57
public sector 28.50000 8.698659 114
strengthen national 28.00000 NA 28
mobile applications 27.00000 NA 27
improve access 26.33333 8.430105 158
health care 26.00000 NA 26
service delivery 25.71429 3.683942 180
public health 25.50000 16.263456 51
national systems 24.00000 NA 24
  • Using top_bigrams_FY2 which had breaks of 1FY
# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 1 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show

# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_FY2 <- bigrams_cleaned %>%
   # cut divides the range of x into intervals
   mutate(FY_group = base::cut(FY_appr, 
                               breaks = year_groups$breaks, 
                               include.lowest = TRUE, 
                               right = FALSE, 
                               labels = year_groups$labels)) %>% 
   # Count the frequency of each bigram by n-year groups
   group_by(FY_group, bigram) %>%
   summarise(count = n(), .groups = 'drop') %>%
   arrange(FY_group, desc(count)) %>%
   # Top ? bigrams for each n-year period
   group_by(FY_group) %>%
   top_n(top_n_i, count) %>%
   ungroup()
# Calculate the mean frequency and standard deviation of the counts for each bigram across periods
stable_and_frequent_bigrams2 <- top_bigrams_FY2 %>%
   group_by( bigram) %>%
   summarise(mean_count = mean(count, na.rm = TRUE),     # Mean frequency across periods
             sd_count = sd(count, na.rm = TRUE),         # Stability (lower sd = more stable)
             total_count = sum(count)) %>%               # Total count across all periods (optional)
   arrange(desc(mean_count)) %>%                      # Sort by frequency and then stability
   # Filter out bigrams with low mean frequency or high instability (you can adjust thresholds)
   # Focus on the top 25% most frequent bigrams
   filter(mean_count > quantile(mean_count, 0.70, na.rm = TRUE)) #%>% 
   # Focus on the most stable 50% (lower sd) ---> NO bc NA values
   #filter( sd_count < quantile(sd_count, 0.5, na.rm = TRUE))

[TBL] Bigrams Over Time [1FY]

# View the most frequent and stable bigrams
stable_and_frequent_bigrams2 %>% 
   slice_head(n = 15)   %>% kableExtra::kable()
bigram mean_count sd_count total_count
mobile applications 27.00000 NA 27
public health 16.66667 3.0550505 50
threat posed 16.50000 2.1213203 33
health preparedness 15.50000 0.7071068 31
slum upgrading 15.00000 NA 15
increase access 14.52941 5.3632738 247
strengthen national 14.00000 2.8284271 28
eligible crisis 13.33333 10.0747208 120
respond promptly 13.00000 9.8994949 26
vulnerable households 13.00000 NA 13
congo basin 12.00000 NA 12
national systems 12.00000 1.4142136 24
proposed operation 12.00000 NA 12
poverty reduction 11.88889 5.1585958 107
climate resilience 11.66667 4.5092498 35

_______

Explore bigrams

_______

>>>>>> QUI <<<<<<<<<<<<<<<<<<

Main ref https://www.nlpdemystified.org/course/advanced-preprocessing rivedere cos’avevo fatto x pulire in analysis//03_WDR_pdotracs_explor.qmd https://cengel.github.io/R-text-analysis/textprep.html#detecting-patterns https://guides.library.upenn.edu/penntdm/r https://smltar.com/stemming#how-to-stem-text-in-r BOOK STEMMING # _______

Isolate other BIGRAM frequency…

[FIG] Most frequent bigrams

# Evaluate the title with glue first
title_text <- glue::glue("Most frequent token in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 

pdo_bigr_freq <- pdo_train_t %>%   # 123,927
   # include only content words
   filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
   #filter (!(upos %in% c("AUX","CCONJ", "INTJ", "DET", "PART","ADP", "SCONJ", "SYM", "PART", "PUNCT"))) %>%
   filter (!(relation %in% c("nummod" ))) %>% # 173,686 
 filter (!(token_l %in% c("pdo","project", "development", "objective","objectives", "i", "ii", "iii",
                          "is"))) %>% # whne it is VERB
   count(token_l) %>% 
   filter(n > 800) %>% 
   mutate(token_l = reorder(token_l, n)) %>%  # reorder values by frequency
   # plot 
   ggplot(aes(token_l, n)) +
   geom_col(fill = "gray") +
   coord_flip() + # flip x and y coordinates so we can read the words better
   labs(title = title_text,
        subtitle = "[token_l count > 800]", y = "", x = "")+
  theme(plot.title.position = "plot")

pdo_bigr_freq

… [FIG] Notable bigrams (climate change)!

Word and document frequency: Tf-idf

The goal is to quantify what a document is about. What is the document about?

  • term frequency (tf) = how frequently a word occurs in a document… but there are words that occur many time and are not important
  • term’s inverse document frequency (idf) = decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
  • statistic tf-idf (= tf-idf) = an alternative to using stopwords is the frequency of a term adjusted for how rarely it is used. [It measures how important a word is to a document in a collection (or corpus) of documents, but it is still a rule-of-thumb or heuristic quantity]

N-Grams

Co-occurrence

_______

TOPIC MODELING w ML

_______

Compare PDO text v. project METADATA [CMPL 🟠]

Using NLP models trained on document metadata and structure can be combined with text analysis to improve classification accuracy.

STEPS

  1. Use document text (abstracts) as features to train a supervised machine learning model. The labeled data (documents with sector tags) will serve as training data, and the model can predict the missing sector tags for unlabeled documents.
  2. TEXT preprocessing (e.g. tokenization, lemmatization, stopword removal, TF-IDF)
    • Convert the processed text into a numerical format using Term Frequency-Inverse Document Frequency (TF-IDF), which gives more weight to terms that are unique to a document but less frequent across the entire corpus.
  3. Define data features, e.g.
    • Document Length: Public sector documents might be longer, more formal.
    • Presence of Certain Keywords: Use specific keywords that correlate with either the public or private sector.
    • Sector Tags: In documents where the “sector tag” is present, you can use it as a feature for training.
  4. Predicting Missing Sector Tags (Classification):
    • Use models like: Logistic Regression: For a binary classification (e.g., public vs. private). Random Forest or XGBoost: If you have a more complex tagging scheme (e.g., multiple sector categories).
    • Cross-validation: Ensure the model generalizes well by validating with the documents that already have the sector tag filled in.
    • Evaluate the model: Use metrics like accuracy, precision, recall, and F1 score to evaluate the model’s performance.

— I could see if corresponds to sector flags in the project metadata

more missing but more objective!

Topic modeling algorithms with Latent Dirichlet Allocation (LDA)

Topic modeling algorithms like Latent Dirichlet Allocation (LDA) can be applied to automatically uncover underlying themes within a corpus. The detected topics may highlight key terms or subject areas that are strongly associated with either the public or private sector.

Named Entity Recognition using CleanNLP and spaCy

NER is especially useful for analyzing unstructured text.

NER can identify key entities (organizations, people, locations) mentioned in the text. By tracking which entities appear frequently (e.g., government agencies vs. corporations), it’s possible to categorize a document as more focused on the public or private sector.

— Summarise the tokens by parts of speech

# Initialize the spacy backend
cnlp_init_spacy() 
quarto render analysis/01b_WB_project_pdo_anal.qmd --to html
open ./docs/analysis/01b_WB_project_pdo_anal.html